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Abstract 



We study a simple learning model based on the Hebb rule to cope with "delayed" , unspe- 
pH , cific reinforcement. In spite of the unspecific nature of the information-feedback, convergence 

Q ' to asymptotically perfect generalization is observed, with a rate depending, however, in a 

O , non-universal way on learning parameters. Asymptotic convergence can be as fast as that 

of Hebbian learning, but may be slower. Morever, for a certain range of parameter settings, 
fT^ , it depends on initial conditions whether the system can reach the regime of asymptotically 

^ ' perfect generalization, or rather approaches a stationary state of poor generalization. 

^ ' 1 Introduction 

O, 

0^ I Introducing biologically motivated features in models for learning has usually a double role: 

testing hypotheses for natural learning and finding hints for artificial learning. These problems 

jrt ■ can be stated at various sophistication levels. Here we do not take the more ambitious point of 

view of describing the complexity of the former or of finding optimal algorithms for the latter. 
On the contrary, our motivation is to investigate which are the capabilities of very elementary 

^ ■ mechanisms. 

O '. One urgent problem with which a system, either natural or artificial, may be confronted 

when trying to improve its performance is to learn only from the final success/failure of series of 
consecutive decisions. The typical situation we may consider is that of an "agent" which let free 
k> , in a complicated "landscape" tries many "paths" to reach a "goal" and has to optimize its path (a 

H I local problem) knowing only the "time" (or cost) it needs to reach the goal (global information). 

Here "goal" may be a survival interest or the solution of a problem, "path" a series of moves or of 
partial solution steps in a complex geographical or mathematical "landscape" etc. The problem 
we want to approach here is to find out whether there are elementary features characterizing 
learning under such unspecific reinforcement conditions. From the point of view of reinforcement 
learning our problem may be seen under the "class Hr' problems in the classification of Hertz et 



al. 1^]. However, we stress that our attitude is not that of finding good algorithms for tackhng 
special problems, like movement, control or games - see, e.g., Q. For this reason we do not 
consider evolved algorithms from the class of Q-learning |^ , of TD learning [^ , agent and critic 
Ig] , etc but restrict to most primitive algorithms which we may think of having a chance to have 
developed under natural conditions. On the other hand, if such algorithms will prove capable of 
tackling the problem they may well give further insights.F] 

In the case of neural network systems the normal situation is already that of lacking detailed 
control over the synapses and learning is achieved by confronting the "pupil" system with the 
correct answer after each presentation of a pattern. For perceptrons both the unsupervised Hebb 
rule and the supervised perceptron algorithm are known to lead to asymptotically perfect gen- 
eralization, although with different asymptotic laws. In our problem setting, however, the pupil 
never knows the right answer to each question, but only the average error it makes over many 
tests. In previous work concerned with this problem [0] (see also [^) we presented an analysis 
of a 2-step algorithm based on the Hebb rule for perceptrons and used computer simulations 
and a rough approximation to estimate the convergence conditions. In the present work we 
undertake a detailed study of this learning algorithm which we call for simplicity "association- 
reinforcement (AR)-Hebb-rule". This algorithm introduces two learning parameters and we find 
that its generalization behaviour is highly nontrivial: in the pre-asymptotic region and depending 
on the network parameters fixed points of the learning dynamics may appear. This leads either 
to asymptotically perfect generalization with non-universal power laws depending on the (ratio 
of the) learning parameters, or to stationary states of very poor generalizationi, according to the 
network parameters and initial conditions. 

That this AR-Hebb-algorithm may be of a more general interest is suggested by applying it 
to a concrete problem of optimizing paths in a landscape with obstacles and traps, in a neural 
network recasting of [^; this study will be presented elsewhere (partial results have been given 
ini). 

In the next section we shall introduce the problem and the algorithm, and in section 3 we 
shall present results from numerical simulations. In section 4 we shall study a coarse grained 
approximation which is appropriate for large networks ("thermodynamic limit"). Section 5 is 
reserved for conclusions. 

2 Learning rule for perceptrons under unspecific reinforcement 

We consider perceptrons with Ising units s, Sj = ±1 and real weights (synapses) Jf. 



''"(VNP-'-) 



sign ( -^ U JiSi 1 = sign ( ^ J • s ) (1) 



Here N is the number of input nodes, and we put no explicit thresholds. The network (pupil) 
is presented with a series of patterns ^j- ' , g € IN, / = 1, ..., L to which it answers with s^'^'K A 
training period consists of the successive presentation of L patterns. The answers are compared 



^ An illustration of the problem was provided in an early paper H dealing with these questions in the simulation 
of a device moving on a board. 



with the corresponding answers t''^''^ of a teacher with pre-given weights Bi and the average error 
made by the pupil over one training period is calculated: 

The training algorithm consists of two parts: 

I. - a "blind" Hebb-type association at each presentation of a pattern: 

jfem)^j(g,/) + ^>0^feO. (3) 



II. - an "unspecific" but graded reinforcement proportional to the average error eg introduced 
in (|2[), also Hebbian, at the end of each training period, 

j(<?+i,i) = jfeL+i) _ ^ y ^teO^feO. (4) 

Because of these 2 steps we call this algorithm "association/reinforcement (AR)-Hebb-rule" (or 
"2-Hebb-rule" , 10] ) . We are interested in the behavior with the number of iterations q of the 
generalization error eg{q): 

eg{q) = -arccos I ,j, ,^, I (5) 

The training patterns ^''''" are generated randomly, and are taken to be unbiased in the present 
paper. The case of structured patterns is more complicated, and will be dealt with in a separate 
publication p. We shall test whether the behavior of eg{q) follows a power law at large q: 

eg{q) ~ const q~^ (6) 

Notice the following features: 

a) During training the pupil only uses its own associations ^W'') <-> sW'') and the average error 
Bq which does not refer specifically to the particular steps /. 

b) Since the answers s''?'" are made on the basis of the instantaneous weight values j''^-'^ 
which change at each step according to eq. (3), the series of answers form a correlated 
sequence with each step depending on the previous one. Therefore Cq measures in fact the 
performance of a "path" , an interdependent set of decisions. 

c) For L = 1 the algorithm reproduces the usual "perceptron rule" (for ai = 0) or to the usual 
"unsupervised Hebb rule" (for a2 = 2ai) for on-line learning, for which the corresponding 
asymptotic behavior is known ]lO[, |11]. 



3 Numerical results 

In a preliminary analysis Q we have tested various combinations of L = 1,5,10,15 and A^ = 
50, 100, 200, 300. We went with q up to 4.10^. We found the convergence of the learning procedure 
to depend on the ratio ai/a2, in particular no convergence was found for L of 5 and higher if 
this ratio was decreased significantly below 0.2. For fixed ai,a2 the asymptotic behavior with 
q appeared well reproduced by a power law and the exponent was found to depend on L. For 
L = 1 varying 01/02 between and 1/2 interpolates between perceptron and Hebbian learning, 
for ratios larger than 1 new asymptotic behavior can be expected to show up (see sect. 4) - we 
did not perform a systematic numerical analysis for L = 1, however. 
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Figure 1: Generalization error eg vs. a for N = 100, L = 5 and L = 10 (upper plots) and 
N = 300, L = 5 and L = 10 (lower plots), for the algorithms (a - d) of eqs. (f^-|T^j. The 
lines indicate the expected asymptotic behaviour as suggested by the coarse grained approximation 
discussed in section 4 for the corresponding 01/02 ratio, as well as 2 further power laws for 
illustration. 



In the present, more precise analysis we use L 
iterations. We introduce: 



5, 10 and N = 100,300, going up to 8.10^ 



a = qL/N. (7) 

We present here results for the following choices of parameters: 

02 = 0.012, and (8) 

(9) 
(10) 



0.012, 




and 


(a) 


ai 


= a2/20; 


(b) 


ai 


= 02/5; 


(c) 


ai 


= 02/5 for a < lOOL, 




ai 


= a2/(2L) for a > IDOL; 


id) 


ai 


= 02/5 for a < lOOL, 




Ol 


= 0. for a > lOOL. 



(11) 

(12) 

We use random initial conditions with the same normalization for the teacher and pupil weights, 
B^/iV = J"^ /N = 1. The results are shown in Fig. ||. In agreement with the preliminary results of 
j^] we find no convergence in the case (a) and convergence in the case (6). If a certain threshold in 
eg is achieved, switching to a smaller ratio 01/02 is seen to accelerate the asymptotic convergence 
- case (c) -, but even then oi cannot be set to zero - case (d). Similar behaviour is observed for 
other N and L > 5. 

This intriguing behaviour incited us to try to obtain analytic understanding by using the 
coarse grained analysis discussed in the next section. 

4 Coarse grained analysis 

We combine blind association (^ during a learning period of L elementary steps and the graded 
unspecific reinforcement (Q) at the end of each learning period into one coarse grained step 

j(.+i,i) = jfei) + ^(ai - 026,) J2 sign(j('^'') • &'^^) &'^ (13) 



L 



^i = ifil l«ig^(B • ^^'''^) - sign(J^^'^^ • &H- (14) 
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We introduce the notations: 



ii(g,0 = -^B.j('?'') , Q(^,0 = ^[j(^'')f (15) 

and we normalize the teacher weights to 1, i.e. B^/A^ = 1 . In the "thermodynamic limit" 
LjN -^ one can treat a as a continuous variable. We shall follow standard procedures [||, |ll| ], 
|jl2| , [|l3| and obtain the following expressions for the changes of R and Q over a coarse grained 
step: 



da 



da 



^^ 1=1 



[ai - ase,) J2 sign(j(«'') • e(«-'))(j('?'') • C^^''^) 



i=i 



+ j^(a, - a^e.f (Y.signiJ^'^'^^ . C(^^^^) ^(^A . 



In the following we shall consider unbiased random input-patterns with 



il,'l)Ak,r). 
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SijSlkSqr- 



(q,l) ^ J_j{q,l) . ^{q,l)^ j^(q,l) ^ J_g . ^(q,l) 



The local fields: 



are then normally distributed with second moments 



{{hf^?) = {Q{q,l)) = Q , {{hf^f)- 
Their joint probability density is thus given by 



{{h^f^h^f^)) = {R{q,l))=R. 



(16) 



(17) 

(18) 
(19) 

(20) 



p{hj,hB) 



27r\/A 



exp ( -^{Qh?B - 2RhjhB + h?j) ) , 



with 



Q-RK 



(21) 



(22) 



In the thermodynamic limit N -^ cxd, the self-overlap of the learner Q and its overlap R with the 
teacher are self-averaging, so that their evolution equations (piq), (|l7|) can be directly rewritten 
in terms of evolution equations for their averages. Moreover, these averages Q and R become 



smooth functions on the a-scale, so that we can neglect the dependence of R and Q on / in (21) 



when used to perform averages on the right hand sides of (16) and (p^), as it would only produce 
0{1/N) corrections to the evolution equations, which become negligible as A^ ^ oo. One thus 
obtains 
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(23) 
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where 



1 f°° _1 2 f°° _J_ 2 

P = j= I 6 2^ sign(3;) / dye 2^^ 

TTV A J-oo JRx 



2 / /? 

1 arccos —= ) . (25) 



The generahzation error is: 



We may formally eliminate one of the learning parameters by rescaling our quantities by the 
parameter 02: 



'^ = ^^^^°^ {tq^ ■ ^^'^ 



We then obtain: 



R = na2, Q = Qal A = — , (27) 

a2 



deg ^ ■ f \ 

, = 1= ^sm(7re„) 

+ ^cotg(vr6,)(A2-(2A-i)6, + (l-i)62j, (28) 

d^ /2"/ 1 1 / 



To establish the asymptotic behaviour we look for solutions of the equations (28,^) in the 
limit of small e^, large Q. To leading order (for A > 0), these equations become: 

d^g fg I A 

"^ - ^A, (31) 



da V vr 

which can be solved exactly to give: 



2 A 



-Q"^/^ + ciQ-2ir for A / |, (32) 



^ - 72^.(^-1)^ ^^ -L' 



^^ - (i^^"^--)^""^ ^-^4' (^^) 



i.e. explicitly 
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Figure 2: Evolution of the generalization error €g (vertical axis) and of 
L = 10 for various X and starting points Q(0). 



Q (horizontal axis) at 
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(34) 

(35) 
(36) 



asymptotically at large a. 

We see that for A < -^ we obtain asymptotically perfect generalization, the dominant term 
exhibiting the usual power -1/2 (and, for L = 1, A = 0.5, also the usual coefficient |ll|), while 
for A > -^ the second term in (|3^j3^ ) dominates and ensures again perfect generalization but 
with a different power law, — 1/(2AL). For A = -j^ we obtain logarithmic corrections - see eq. 
(33,35). Notice that these results hold also for L = 1. 

In the case A = one can see from (25,29) that starting with any finite Q one cannot have 
perfect generalization for L > 1. For L = 1 one reobtains the asymptotic behavior found in |lO| ]. 

There is, however, a nontrivial pre-asymptotic region, which turns out to be dominated by 
two stationarity conditions, one for the self-overlap, dQ/da = 0, and one for the overlap with the 
teacher-configuration, dTZ/da = or, alternatively, that for the generalization error deg/da = 0. 
For suitable values of the network parameters, the two stationarity conditions may simultaneously 
be satisfied, leading to fixed points of the learning dynamics, one of these fully stable and with 
poor generalization, the other partially stable. 

To this pre-asymptotic region we shall now turn our attention and thereby also obtain further 
specifications for the parameters. In Fig. |^ we show the evolution of eg and Q according to 
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Figure 3: Flow in the plane eg, Q. The dot-dashed 
dtg/da = 0, the dashed line to dQ/da = 0. Left: L 
stable, one partially stable) clearly show up, the full 
manifolds of the partially stable fixed point. Right: L - 
there are no fixed points. For every starting point we 
Compare also with Fig. ^. 



line corresponds to stationarity condition 
= 10, A = 0.075; two fixed points (one all 
lines represent the stable and the unstable 
= 5, A = 0.2, a parameter setting for which 
have convergence to perfect generalization. 



eqs. ( [2^j29| ), starting from eg(0) = 0.5 and various Q(0) = Qo-H The various trajectories are 
parameterized by A. In all cases there is a critical value Ac(Qo) which separates flows toward 
a stationary state of poor generalization from flows toward perfect asymptotic generalization. 
The fixed point in the Q, eg plane (with a location parameterized by A) which is responsible for 
this behaviour has an attractive and a repulsive direction. For a given initial condition Qq, the 
critical value Ac(Qo) is defined as that value for which the attractive manifold connects the initial 
condition to the partially stable fixed point; for smaller values of A the flow always is from the 
initial condition to the fully stable fixed point with poor generalization, for slightly larger values 
of A the flow is towards asymptotically perfect generalization. At still larger values of A the two 
fixed points eventually coalesce and disappear altogether. Then we always have asymptotically 
perfect generalization. Some values for Ac(Qo) are given in Table |l|. 
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Ac(Qo) 


0.2545(5) 


0.2185(5) 


0.1385(5) 


0.0875(5) 


0.0485(5) 



Table 1: Critical value of X for L = 10 and various initial conditions. 



^Notice that due (Ej) the dependence on the initial conditions Qo may be translated into a dependence on the 
learning rate for the initial network: for a fixed ratio A of learning rates, and given values of the original overlaps 
Q and R, finer updating (smaller ai and 02) is equivalent to larger values of rescaled overlaps, hence a larger value 
of Qo. 



In Fig. ^ we describe the flow in this plane for a given A, this should be compared with the 
a-trajectories in the Q, eg plane for various A with different starting points Qo, Fig. |2|. In Fig. § 
we plot directly eg{a). As can be seen from all these figures, for A < Ac the training leads to an 
initial improvement which is however limited and followed by a very rapid deterioration toward 
confusion. For A > Ac, on the contrary, the learning stabilizes and leads to asymptotically perfect 
generalization with a A-dependent power law in agreement with eqs. (34,^5). 

These analytic results compare very well with the numerical results given in the previous 
section, both in the pre-asymptotic and in the asymptotic region (cf. Fig. |^). 
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Figure 4: Generalization error eg vs. a at L = 10 for various A and for starting point 
2(0) 



1000. The straight lines on the left plot show the dominant asymptotic behaviour for the 
corresponding A from (34,'3^ (notice that for A < 1/L the normalization is fixed; for A < 1/L 



we give also a fit using the subdominant terms in (3^,35)). Right: Amplified view at the pre- 
asymptotic region. 



5 Summary and Discussion 

In the present paper we have investigated a two-phase learning algorithm for perceptrons, named 
AR-Hebb-algorithm. Its first phase consists of a series of Hebb-type synaptic modifications, 
correlating, however, input and self-computed output (blind association) rather than input and 
clamped teacher output. This first phase is followed by an unspecific but graded reinforcement- 
type learning step which leads to a partial reversal of the previous series of Hebb-type synaptic 
modifications, depending on current average success rates. 

Our main motivation has been biological, attempting to honour the observation that a 
learner's control over its neurons and synapses might be less specific and direct than ordinary 
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supervised learning algorithms usually presume, while basically adhering to the Hebbian learning 
paradigm. 

Our central results can be stated as follows: 

a) Despite the fact that feedback on the learner's performance enters its learning dynamics 
only in an unspecific way in that it cannot be associated with single identifiable correct or 
incorrect associations, convergence of the AR-Hebb-algorithm in the sense of asymptotically 
perfect generalization is observed. 

b) For given initial conditions, this convergence depends on the parameters of the algorithm; in 
particular none of these parameters can be set to 0. Alternatively, at fixed L and ratio of the 
algorithm parameters convergence may depend on initial conditions. 

In the details the dynamics of this algorithm was found to be unexpectedly complex. Depend- 
ing on the parameters, fixed points in the dynamic flow may emerge — one stable, the other only 
partially stable. The attracting manifold of the latter constitutes a separatrix dividing initial 
states into two sets, one for which the algorithm converges, and another for which it doesn't in 
which case the flow is driven to the all-stable fixed point with poor generalization. Seen from a 
different point of view, a given initial condition (given updating speed) may be found to belong 
to the asymptotically converging lot, or to end up in a state of poor generalization, depending 
on network parameters. 

On the other hand, parameter settings may be varied in such a way that the two fixed points 
eventually coalesce and disappear, rendering convergence of the algorithm independent of initial 
conditions. The pre-asymptotic regime of the learning process is then still influenced by the lines 
in the €g-Q plane along which either deg/da or dQ/da (but not both) vanish. 

Much to our surprise, the convergence-rate of the algorithm was found to depend in a non- 
universal manner on the ratio of learning parameters. In spite of the non-specific nature of the 
information-feedback on the learning dynamics, convergence can be as fast as that of Hebbian 
learning, eg ~ a~^'^, if XL < 1, whereas it is slower and exhibits a non-universal parameter 
dependent rate, eg ~ a^^'^ , if XL > 1. Logarithmic corrections appear in the marginal case 
AL = 1. 

One may ask oneself, why there is no generalization for a perceptron-type algorithm A = 
(i.e., oi = 0). We can offer a simple observation which may be of heuristic value: since for L = 1 
Cq can only be or 1 ai = means penalty for failure, no change for success, i.e. the usual 
perceptron learning rule known to converge. However, for L > 1 Cg can take fractional values in 
the interval [0, 1]. In this case ai = means penalty for all answers which are short of perfect, 
i.e. even if the pupil is successful in far above 50% of the cases. This procedure can turn out to 
be destructive. 

To put our findings into a broader perspective, it is perhaps appropriate to note that a similar 
kind of unspecific information feedback as in our setup occurs in committee-machine learning. 
While in our case, information feedback is unspecific in time (with respect to the pattern la- 
bels within a longer series on which the learner may have been in error), unspecificity in the 
committee-machine refers to space, i.e., the label(s) of the node(s) which may have contributed 
to a wrong output upon presentation of a single pattern. In the details, though, the way in 
which unspecific feedback is utilized in the dynamics is different in the two setups, leading to 
different asymptotic laws, and to different behaviour in the preasymptotic regime. Although 
plateaus in the learning dynamics occur in both setups, this similarity is superficial. Whereas 
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in the committee-machine, the appearance of plateaus is related to a permutation symmetry of 
the nodes and escape therefrom to its breaking (a transition to specialization), there is strictly 
speaking no time-translation symmetry within a coarse-grained step, and no breaking thereof, as 
each coarse-grained step constitutes a whole correlated path of events during which the learner 
already evolves in response to the patterns presented. Quantitatively the difference manifests 
itself in the fact that plateaus in our setup have a much higher generalization error than those 
in the committee-machines, and that the AR-Hebb rule may converge to a state of poor gener- 
alization even if its its initial performance is almost perfect (as can be seen in Fig. pk). Still, 
it may be interesting to enquire whether techniques akin to those invented in order to decrease 
the extent of plateaus in committe- machine learning (see [14| for a recent reference) might be 
utilized to improve the present setup. 

We have not up to now addressed issues related to optimal parameter settings or optimal 
online-control of parameters (the latter issue would in some sense run against our original bio- 
logically minded starting point), nor did we so far investigate the performance of the algorithm 
in multi-layer architectures. Clearly these may be interesting topics to pursue in future research, 
as may be more detailed investigations of the algorithm as an intricate dynamical system per se. 

Note added in proof: We should like to add the following interesting observation. A 
variant of the present algorithm which introduces an additional biologically motivated element 
of indeterminism by including patterns in the second (reinforcement) phase of a session only with 
probability p < 1 (the student does not remember everything it did in the first phase) shows 
qualitatively the same behaviour as the algorithm studied in the present paper. A rough first 
quantitative characterization of this modification would be that it leads to an effective rescaling of 
the parameter 02 of the algorithm by approximately a factor p, entailing corresponding rescalings 
of the parameter A and the scaled self-overlap Q, viz. X — > X/p and Q -^ Q/p^ ■ Asymptotic 
convergence will then be slower, with an exponent computed from the rescaled value X/p rather 
than from the bare A. It also leads to a corresponding reduction of critical A's for given initial 
condition Qo or, alternatively, to a reduction of the minimum Qq required for convergence at a 
given bare A. These results are well corroborated by numerical simulations. 
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